Pangenome-wide association study reveals the selective absence of CRISPR genes (Rv2816c-19c) in drug-resistant Mycobacterium tuberculosis

ABSTRACT The presence of intermittently dispersed insertion sequences and transposases in the Mycobacterium tuberculosis (Mtb) genome makes intra-genome recombination events inevitable. Understanding their effect on the gene repertoires (GR), which may contribute to the development of drug-resistant Mtb, is critical. In this study, publicly available WGS data of clinical Mtb isolates (endemic region n = 2,601; non-endemic region n = 1,130) were de novo assembled, filtered, scaffolded into assemblies, and functionally annotated. Out of 2,601 Mtb WGS data sets from endemic regions, 2,184 (drug resistant/sensitive: 1,386/798) qualified as high quality. We identified 3,784 core genes, 123 softcore genes, 224 shell genes, and 762 cloud genes in the pangenome of Mtb clinical isolates from endemic regions. Sets of 33 and 39 genes showed positive and negative associations (P < 0.01) with drug resistance status, respectively. Gene ontology clustering showed compromised immunity to phages and impaired DNA repair in drug-resistant Mtb clinical isolates compared to the sensitive ones. Multidrug efflux pump repressor genes (Rv3830c and Rv3855c) and CRISPR genes (Rv2816c-19c) were absent in the drug-resistant Mtb. A separate WGS data analysis of drug-resistant Mtb clinical isolates from the Netherlands (n = 1130) also showed the absence of CRISPR genes (Rv2816c-17c). This study highlights the role of CRISPR genes in drug resistance development in Mtb clinical isolates and helps in understanding its evolutionary trajectory and as useful targets for diagnostics development. IMPORTANCE The results from the present Pan-GWAS study comparing gene sets in drug-resistant and drug-sensitive Mtb clinical isolates revealed intricate presence-absence patterns of genes encoding DNA-binding proteins having gene regulatory as well as DNA modification and DNA repair roles. Apart from the genes with known functions, some uncharacterized and hypothetical genes that seem to have a potential role in drug resistance development in Mtb were identified. We have been able to extrapolate many findings of the present study with the existing literature on the molecular aspects of drug-resistant Mtb, further strengthening the relevance of the results presented in this study.

• Upload point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT in your cover letter.
• Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file.
• Upload a clean .DOC/.DOCX version of the revised manuscript and remove the previous version.
• Each figure must be uploaded as a separate, editable, high-resolution file (TIFF or EPS preferred), and any multipanel figures must be assembled into one file.
• Any supplemental material intended for posting by ASM should be uploaded with their legends separate from the main manuscript.You can combine all supplemental material into one file (preferred) or split it into a maximum of 10 files with all associated legends included.
For complete guidelines on revision requirements, see our Submission and Review Process webpage.Submission of a paper that does not conform to guidelines may delay acceptance of your manuscript.

Importance
The results from the present Pan-GWAS study comparing gene sets in drug-resistant and drug-sensitive Mtb isolates revealed intricate presence-absence patterns of genes encoding DNA binding proteins having gene regulatory as well as DNA modification and DNA repair roles.Apart from the genes with known functions, we have identified some Uncharacterized and Hypothetical genes that seem to have a significant role in drug resistance development in Mtb.We have been able to extrapolate many findings of the present study with the existing literature on drug-resistant Mtb isolates, further strengthening the relevance of the results presented in this study.This study contributes to the foundation of future research on the drug resistance development in Mtb.

Introduction:
According to the Global Tuberculosis Report 2023 by the World Health Organization (WHO), approximately 410,000 new drug-resistant Tuberculosis (TB) cases were reported in 2022.TB has a higher prevalence in tropical regions, such as South Asia and Africa, that mainly comprise poverty-stricken developing nations, compared to other parts of the world.
The causative agent of TB, i.e., Mycobacterium tuberculosis (Mtb), is a slow-growing pathogen that mostly causes lower respiratory tract infections.Mtb has several intrinsic drug resistance-conferring factors including thick cell wall, lipid-rich cell membrane, druginactivating enzymes and drug target modification systems.Host-dependent selective pressures such as incompliance, inappropriate dose of antibiotics, and factors like delayed diagnosis lead to the accumulation of mutations in specific Mtb genes that result in ineffective drug action (1,2,3,4).Despite the introduction of several novel and repurposed drugs for the treatment of drug-resistant TB, unresponsiveness to new therapeutics is increasingly being reported (5).Mtb's slow growth rate, ability to remain dormant for decades, development of drug tolerance, and drug resistance also contribute to the relapse of TB.
Drug resistance in Mtb is primarily caused by the acquisition of Single Nucleotide Polymorphisms (SNPs) in the drug target genes, prodrug-activating genes, their promoter regions, and other genes that are involved in the mechanism of action of respective drugs.
Unlike in other bacteria, Horizontal Gene Transfer (HGT) of the drug resistance-conferring plasmids does not contribute to the development of drug resistance in Mtb (6)(7)(8).However, Mtb is susceptible to infection by Mycobacteriophages and can also undergo natural intragenome recombination events (9,10).In a recent study, the insertion sequence IS6110 that encodes a transposase is found to actively take part in transposition, thereby leading to genetic deletions in an observable time frame of one year.IS6110 is also reportedly more abundant in Lineage 2 (Beijing) Mtb isolates, which are widely known to be highly drugresistant (11,12).Events like gene deletions can directly contribute to the emergence of drug resistance or indirectly by improving the fitness.
Genome-Wide Association Studies (GWAS) have been extensively used to identify drug resistance-associated SNPs in Mtb, but its application at the pangenome level is limited (13).
Existing literature on Pan-GWAS of Mtb seems to be primarily focused on identifying the genetic determinants taking part in higher prevalence, the site of infection (Extrapulmonary/Pulmonary) and those forming inter-species diversity (14)(15)(16).A Pan-GWAS study from 2018 identified 24 novel genetic signatures associated with drug resistance using a sample set of 1595 from varying geographical regions (17).Another Pan-GWAS study, also from 2018, was more focused on understanding the causation of atypical drug resistance in Mtb isolates.In this, several unique genes, as well as intergenic regions, are found to be exclusively associated with atypical drug resistance in Mtb compared to 145 other isolates with typical drug-resistant markers (18).
Given the limited literature on the pangenome of drug-resistant Mtb isolates and our understanding of genomic structural variations that may arise upon drug resistance development, we aimed to identify unique gene repertoires by analysing a publicly available Mtb WGS data set from TB endemic countries.The identified gene sets in the present study might be useful in developing region-specific diagnostic tools, identifying novel drug targets and understanding the host-pathogen interactions of drug-resistant Mtb isolates.

Statistical analysis:
Benjamin-Hochberg (BH) test was applied for determining the association of genes to drug-sensitive and drug resistant Mtb isolates using Scoary v1.6.16.(32).The genes qualifying the criteria of the BH adjusted p-value (< 0.01) and Log 10 Odds ratio (> 0.5 or < -0.5) were considered to have significantly perturbed association either with drug-resistance or drug-sensitive isolates.GraphPad Prism v8.0.2. and MS Excel (2016 home edition) were used for data analysis and representation.
Finally, a set of 2184 samples, qualified all data filtering steps (DR profiling and strain mixing determination, Metagenome detection, De novo QC, and Md5Checksum-based de-duplication of annotated genomes) (Figure 1B).We considered this data set as high-quality and subsequently used for Pan-GWAS.
Phylogenetic analysis grouped these samples into 4 separate clades corresponding to the four lineage (Figure 2A).

Pan-GWAS analysis reveals gene repertoire differences in drug-resistant and sensitive
Mtb isolates: In total, a set of 4893 genes were found in these 2184 high-quality annotated Mtb genomes (Figure 2B).Gene repertoire analysis of the qualified sample set using Panaroo showed 3784 core genes (present in >99 % of isolates), 123 softcore genes (present in 95-98% of isolates), 224 shell genes (present in 15-94% of isolates), and 762 genes (present in >0-14% isolates).Click or tap here to enter text.
A set of 187 genes was found to be significantly associated with either drug resistance or drug sensitivity status (Benjamin Hochberg adjusted p-value < 0.01).Amongst these, 115 genes aligned to multiple regions of Mtb H37Rv because of their sequence redundancy and were excluded from further analysis.Out of the rest 72 genes that showed a single hit after pairwise alignment, 39 genes showed a negative association with drug resistance (absent in drug-resistant isolates), while 33 genes showed a positive association (present in drug resistant isolates).
Apart from the genes with extreme Log 10 OR association scores, we observed negative and positive association with drug resistance status of CRISPR-genes (Rv2816c-19c) and Toxin-Antitoxin genes (Rv2231A: VapC16 toxin and Rv2653c-54c), respectively, with high significance.The genes with significant association either with drug resistance or drug sensitivity status are shown in Table 1A and 1B.
Gene Ontology analysis: DAVID gene ontology (GO) clustering of the genes showed enrichment of biological processes viz., antiviral defense (GO 0051607 and KW0051), endonuclease activity (KW0255), nuclease activity (KW0540), and Hydrolases (KW0378) drug-sensitive Mtb isolates (Figure 3A).This indicates that the bacterial immune system is compromised in most drug-resistant Mtb isolates.
GO analysis also revealed presence of DNA binding domains (viz., HTH and HNH) in 17 out of 72 genes with perturbed association potentially taking part in DNA modification or gene regulation.Out of 17 genes with DNA binding domains, 12 genes had had Nuclease, Primase, and Helicase activities that can take active part in DNA modification.Out of 12 genes encoding DNA-modifying enzymes, 10 genes forming three genetic islands were observed.

Discussion:
Understanding the emergence of drug resistance in Mtb is quintessential to predict its future evolutionary path.Under selective pressure, Mtb acquires mutations in specific genes that contribute to the drug resistance development (3).These mutations are used as markers for diagnosing the drug resistance status of clinical Mtb isolates and are detected by molecular tools.Many databases such as TBDreamDB, WHO mutation catalogue and PhyResSE provide the details of such drug resistance associated mutations (33)(34)(35).Apart from these mutations, larger deletions have the potential to contribute to the development of drug resistance emergence in Mtb.These deletions may occur upon Mycobacteriophage-Mtb encounters and intragenomic recombination events (9,10,36).The Mtb genome houses a plethora of intermittently as well as continuously dispersed repetitive sequences such as insertion sequences, direct long and short repeats.These redundant genetic elements of Mtb are analyzed by various molecular typing tools such as IS6110-RFLP, Spoligotyping, MIRU-VNTR, and PGRS-RFLP (36).These redundant genetic elements could be present in close proximities and participate genetic rearrangements and deletions in specific Mtb isolates (37)(38)(39).Mtb genome also houses genes such as IS6110 (insertion sequence and transposase), Rv3798 (probable transposase as per Mycobrowser) and RD1 region having genes with transposase-like activity (40).Factors like Mycobacteriophages-Mtb encounters, presence of redundant genetic elements and transposase genes make Mtb vulnerable to natural recombination events that can cause gene duplications and partial or complete deletion of genetic islands consisting of more than one non-essential gene.Many of these genetic islands serve as regions of differences (RD) and aid in the identification of infecting strains in surveillance and epidemiological studies and are also involved in conferring virulence in Mtb (41).Specific genes such as those that form membrane proteins and Secretome (for example, RD155 genes including eccCb1, PE35, esxB, esxA, eccD1, espK, and P1cA-C) are reported to be associated with higher virulence in Mtb (41,42).Based on these factors, we speculated that the genes with differential presence-absence patterns can potentially provide clues to understanding the evolution and emergence of drug resistance in Mtb.
In the present study we conducted Pan-GWAS using publicly available WGS data of Mtb clinical isolates reported from TB endemic countries (India, Pakistan, Zambia and China).
The presence of non-Mtb metagenomes in unprocessed WGS data can introduce foreign genes and affect the accuracy of the Pan-GWAS results so samples having signatures of non-Mtb metagenomes were excluded.The WGS dataset (n=2184) which had high de-novo assembly quality and were non-duplicates was used for Pan-GWAS (Figure 1).Majority (>95%) of the lineage 2 samples showed drug-resistant profiles (Figure 1D, S1-A) corroborating earlier reports (43)(44)(45).Gene repertoire analysis identified 3784 genes that form the core Mtb genome, which is within the reported range (15).
After computation of Log 10 OR values to determine the direction of association of specific genes having high significance (p-value < 0.01), we observed many intricacies that can be further investigated to understand the mechanistic aspects of drug resistance development as well as associated virulence.For example, the observed pattern in phage (negatively associated with drug resistance) and prophage proteins (positively associated with drug resistance) in this study in Mtb isolates may be responsible for the previously known complex relationship between virulence and drug resistance observed in various bacteria, including in Mtb (46).We also observed positive association of Toxin-Antitoxin genes (Rv2231A, Rv2653c and Rv2654c) with drug resistance.These Mtb genes are reported to be involved in the induction of dormancy, drug tolerance and formation of persister Mtb population (47).
The Mtb genes with identical presence-absence patterns in certain genetic islands indicates that the constituting genes are perhaps acquired simultaneously in now-extinct ancestral Mycobacterium prototuberculosis through historical HGT or phage infection events.These genes were subsequently found to be co-expressing (StringDB).
Upon analyzing the highly significant (p-value < 0.01) genes having extreme Log 10 OR scores, strong corroboration with existing literature was found.Rv3855, Rv2765, Rv3830c, Rv0071, Rv0072 and Rv0073 were found as top 6 genes with most negative association scores which means that they are absent in significant proportion of drug resistant isolates.
Rv3855 (Log 10 OR=-2.1,HTH-type transcriptional repressor EthR), a repressor of ethionamide-activating gene Rv3854 and its absence in drug resistant isolates might not directly cause ethionamide drug resistance but could increase the tolerance to ethionamide.
This tolerance could offer a survival advantage, potentially contributing to the evolution of drug resistance like phenotype (48).Mutations in the upstream region of, Rv2765 (Log 10 OR=-1.6,Hydrolase) are previously known to be associated with ethambutol drug resistance (49).Absence of Rv3830c (Log 10 OR=-1.4,TetR family transcriptional regulator), which is a repressor that binds to the regulatory region of various multidrug efflux pumps can potentially cause overexpression of its target multidrug efflux pump genes in the drug resistant Mtb isolates (50).Rv0071 (Log 10 OR=-1.4,Maturase), Rv0072 (Log 10 OR=1.3, Glutamine ABC transporter permease) and Rv0073 (Log 10 OR=-1.4,Glutamine ABC transporter ATP binding protein) constitute RD105 as per RDScan database (51).The deletion of RD105 is already known to be associated with drug resistance development to multiple anti-TB drugs corroborating our findings (52).
Similarly, Rv3919c, Rv1844c, Rv1355c, Rv3383c, and Rv1371 were found to have the most positive Log 10 OR scores indicating a positive association with drug-resistant Mtb isolates, which means that these genes are present in a significant number of drug-resistant isolates.
Rv3919c (Log 10 OR=+2.2, 16S rRNA (guanine(527)-N( 7))-methyltransferase RsmG), is previously known to be associated with low-level resistance to streptomycin (53).Rv1844c (Log 10 OR=+1.79,6-Phosphogluconate dehydrogenase), is also previously known to be associated with Isoniazid resistance (54,55).Rv1355c (Log 10 OR=+1.4,Molybdopterin biosynthesis protein) showed positive association with drug resistance.Rv1355c and the downstream gene Rv1356c are co-expressing genes (as per StringDB).Rv1356c is already known to have significantly lower expression in drug-sensitive Mtb isolates (56).Based on these factors as well as their tandem occurrence on the genome, Rv1355c and Rv1356c can be speculated to have a relationship like that of a genetic island and an operon.Rv3383c (Log 10 OR=+1.3,Polyprenyl synthetase IdsB), is known to have pyrazinamide resistanceassociated mutation hotspots (57).A GWAS study previously revealed an association of Rv1371 (Log 10 OR=+1.3, a membrane protein) with resistance to drugs of multiple classes, including Ethambutol, injectables, Ethionamide, Delamanid and Linezolid, corroborating with our findings (58).
Important to highlight is the seemingly critical role of CRISPR-associated genes (Log 10 OR=-1.3 to -1.04) that showed negative association with drug resistance.Absence of Rv2816c-17c is most Lineage 2 Mtb isolates is previously known and that its absence can have a cumulative effect on the impaired DNA repair in drug-resistant Mtb isolates (59).Also, overexpression of Rv2816c is previously known to decrease the drug susceptibility in M.
smegmatis whereas Rv2816c-knockout strains are also known to impart drug tolerance and drug resistance-like phenotype (59)(60)(61)(62).Wang et al. recently discussed the association of the absent CRISPR system with drug resistance in Klebsiella pneumoniae (63).Similarly, its absence is also known to be associated with drug resistance in Shigella (64).Earlier reports showed that the upstream region of Rv2816c-17c-18c-19c comprises insertion sequences and direct repeats, which are vulnerable to insertion sequence-driven genome deletions (65).
Many genes including the hypothetical ones showed strong association with drug resistance development in Mtb.These genes may not have a direct causation of drug resistance, therefore, elucidating their roles with respect to the emergence drug resistance can potentially enrich our understanding of drug resistance evolution in Mtb.

Figure 1 :
Figure 1: Whole Genome Sequencing (WGS) data filtering and population structure of clinical Mycobacterial tuberculosis isolates included in the study.A: Workflow employed to analyse publicly available WGS data.The data analysis consisted of data acquisition, assembling, DR profiling, lineage determination, metagenome detection, scaffolding, de novo assembly QC, functional annotation, and removal of duplicate GFF files.2601 (m) samples were downloaded for data analysis.Sample in the sets k 1-4 successfully underwent preprocessing, and showed high quality on various metrics.Upon Venn analysis, 2184 samples were found common in k 1-4 and were subsequently subjected to gene repertoire analysis and Pan-GWAS (B).The high-quality sample set (n=2184) consisted of Mtb isolates from 4 TBendemic countries (India, Pakistan, Zambia and China) (C) with varying drug resistance profiles and lineages (D and E).Abbreviations: DR: Drug Resistant; m: the total number of

Table 1 : To identify significant association with drug resistance and drug sensitivity, the genes were filtered based on BH-adjusted p-value (<0.01) and Log 10 OR (>0.5 and < - 0.5).
samples downloaded from NCBI-SRA; k 1 : number of samples passing >95% of reads aligning with MTBC species; k 2 : number of samples having successfully characterized with no evidence of mixed lineages; k 3 : number of samples that passed de novo quality metrics; k 4 : number of samples passing deduplication based on MD5checksum of GFF files.The tables show genes (locus tag), their products, and the value of Log 10 OR, indicating the direction of association (Resistance and Sensitivity).Positive Log 10 OR scores indicate positive association with drug resistance (present in drug resistant isolates) and negative Log 10 OR scores indicate negative association with drug resistance (present in drug sensitive isolates).The loci in red text form RD as per the RDScan database.A: Genes having a negative association; B: Genes having a positive association with drug resistance.